Skip to content

Conversation

@jairad26
Copy link
Contributor

@jairad26 jairad26 commented Oct 16, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • This PR adds schema support to the js client, along with tests, and logic to embed sparse vectors using efs, using the schema for dense vecs, and tests to ensure serialization and deserialization work
  • New functionality
    • ...

Test plan

How are these changes tested?

added schema unit tests matching python ones

  • [ x] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

jairad26 commented Oct 16, 2025

@jairad26 jairad26 force-pushed the jai/schema-js-impl branch 3 times, most recently from c51fb45 to 61c4404 Compare October 16, 2025 20:28
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from f40f1ea to e3a6cb8 Compare October 16, 2025 20:33
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from e3a6cb8 to 5439079 Compare October 16, 2025 22:33
@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Oct 16, 2025
@jairad26 jairad26 force-pushed the jai/schema-js-impl branch 2 times, most recently from fe26daa to ac00bdc Compare October 16, 2025 23:35
@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch 2 times, most recently from 8bd6af6 to be61947 Compare October 17, 2025 00:40
@jairad26 jairad26 marked this pull request as ready for review October 17, 2025 00:40
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Oct 17, 2025

Add Schema Support to JS Client with Dense and Sparse Vector Indexing

This PR introduces a comprehensive schema system to the JavaScript client for ChromaDB, bringing parity with Python for collection schema configuration, embedding function management, and test coverage. The changes provide a flexible and extensible schema abstraction, allowing users to programmatically define, configure, serialize, and deserialize index types (including sparse and dense vector indexes, inverted indexes, FTS, and more) for each field in a collection. The update includes end-to-end schema serialization tests, dense and sparse vector embedding integration, and logic in the core collection path to utilize schema-aware embedding function resolution and sparse auto-embedding.

Key Changes

• Introduced schema.ts with Schema class encapsulating collection field-level index configuration, including FTS, vector, inverted, and sparse vector indexes.
• Added support for registering and resolving both dense and sparse embedding functions as part of the schema, with automatic resolution for dense and sparse fields during record preparation.
• Integrated schema field configuration into the main collection API (CollectionImpl), with logic to use schema-driven embedding functions if not explicitly set.
• Implemented applySparseEmbeddingsToMetadatas in the collection path to enable automatic sparse vector embedding generation based on schema config (e.g., using #document or metadata source fields), including tests for override and fallback logic.
• Extended collection create/get/fork/list logic to serialize, deserialize, and roundtrip schema configuration to and from the API, supporting both legacy and new schemas.
• Added 1400+ lines of comprehensive unit tests (schema.test.ts) verifying schema initialization, create/delete/chained operations, serialization/deserialization, and sparse embedding edge cases.
• Refactored and updated helpers in embedding-function.ts, chroma-client.ts, and collection-configuration.ts to support the new synchronous embedding function resolution API.
• Extended index.ts to publicly export schema utilities and types.

Affected Areas

clients/new-js/packages/chromadb/src/schema.ts (new major module)
clients/new-js/packages/chromadb/src/collection.ts (collection core logic and embedding integration)
clients/new-js/packages/chromadb/src/chroma-client.ts (API roundtrip, collection factory logic)
clients/new-js/packages/chromadb/src/embedding-function.ts (embedding function registry, resolution, and new types)
clients/new-js/packages/chromadb/src/collection-configuration.ts (configuration processing)
clients/new-js/packages/chromadb/src/index.ts (public schema export)
clients/new-js/packages/chromadb/test/schema.test.ts (extensive test coverage)

This summary was automatically generated by @propel-code-bot

Comment on lines 219 to 235
data.map(async (collection) =>
new CollectionImpl({
chromaClient: this,
apiClient: this.apiClient,
name: collection.name,
id: collection.id,
embeddingFunction: await getEmbeddingFunction(
collection.name,
collection.configuration_json.embedding_function ?? undefined,
),
configuration: collection.configuration_json,
metadata: collection.metadata ?? undefined,
schema: Schema.deserializeFromJSON(collection.schema ?? undefined),
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The logic for creating a CollectionImpl instance from an API response is repeated in listCollections, createCollection, getCollection, and getOrCreateCollection. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example _collectionFromResponseData(data, embeddingFunction?). This would centralize the construction of collection objects from API data.

ChromaDB Best Practice: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.

Context for Agents
[**BestPractice**]

The logic for creating a `CollectionImpl` instance from an API response is repeated in `listCollections`, `createCollection`, `getCollection`, and `getOrCreateCollection`. To improve maintainability and reduce code duplication, consider extracting this logic into a private helper method, for example `_collectionFromResponseData(data, embeddingFunction?)`. This would centralize the construction of collection objects from API data.

**ChromaDB Best Practice**: Following the official ChromaDB JavaScript client patterns, API responses should be consistently transformed into Collection objects. A helper method ensures consistent handling of the API response structure and metadata across all collection operations.

File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 232

@jairad26 jairad26 force-pushed the jai/schema-e2e-tests branch from be61947 to 9322a2c Compare October 17, 2025 03:00
@@ -0,0 +1,1002 @@
import type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The new schema.ts file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main Schema class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules.

For example, you could structure it like this:

  • src/schema/constants.ts
  • src/schema/index-configs.ts
  • src/schema/value-types.ts
  • src/schema/utils.ts
  • src/schema/schema.ts (for the main class)
  • src/schema/index.ts (to re-export everything)

This would make the code easier to navigate and understand.

Context for Agents
[**BestPractice**]

The new `schema.ts` file is quite large and contains many distinct concepts (constants, config classes, type classes, utility functions, and the main `Schema` class). For better maintainability and code organization, consider splitting this file into smaller, more focused modules. 

For example, you could structure it like this:
- `src/schema/constants.ts`
- `src/schema/index-configs.ts`
- `src/schema/value-types.ts`
- `src/schema/utils.ts`
- `src/schema/schema.ts` (for the main class)
- `src/schema/index.ts` (to re-export everything)

This would make the code easier to navigate and understand.

File: clients/new-js/packages/chromadb/src/schema.ts
Line: 1

@jairad26 jairad26 changed the base branch from jai/schema-e2e-tests to graphite-base/5621 October 17, 2025 04:21
@jairad26 jairad26 changed the base branch from main to graphite-base/5621 October 21, 2025 22:49
@jairad26 jairad26 changed the base branch from graphite-base/5621 to jai/export-python-hosted-splade-model October 21, 2025 22:50
Comment on lines +413 to +415
});

// Generate embeddings for all collected documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Array length validation missing: If sparseEmbeddings.length !== positions.length, the error is thrown but the actual values aren't included in the error message, making debugging difficult.

Suggested Change
Suggested change
});
// Generate embeddings for all collected documents
if (sparseEmbeddings.length !== positions.length) {
throw new ChromaValueError(
`Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
);
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

Array length validation missing: If `sparseEmbeddings.length !== positions.length`, the error is thrown but the actual values aren't included in the error message, making debugging difficult.

<details>
<summary>Suggested Change</summary>

```suggestion
if (sparseEmbeddings.length !== positions.length) {
  throw new ChromaValueError(
    `Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
  );
}
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 415

Comment on lines +450 to +454

const sparseEmbeddings = await this.sparseEmbed(embeddingFunction, inputs, false);
if (sparseEmbeddings.length !== positions.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Same issue: include actual vs expected values in error message for better debugging.

Suggested Change
Suggested change
const sparseEmbeddings = await this.sparseEmbed(embeddingFunction, inputs, false);
if (sparseEmbeddings.length !== positions.length) {
if (sparseEmbeddings.length !== positions.length) {
throw new ChromaValueError(
`Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
);
}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

Same issue: include actual vs expected values in error message for better debugging.

<details>
<summary>Suggested Change</summary>

```suggestion
if (sparseEmbeddings.length !== positions.length) {
  throw new ChromaValueError(
    `Sparse embedding function returned unexpected number of embeddings. Expected ${positions.length}, got ${sparseEmbeddings.length}`,
  );
}
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

</details>

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 452

Comment on lines +359 to +362
private async applySparseEmbeddingsToMetadatas(
metadatas?: Metadata[],
documents?: string[],
): Promise<Metadata[] | undefined> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This function is quite long and contains duplicated logic for handling sourceKey === DOCUMENT_KEY and the general case where sourceKey is a metadata field. The core logic of collecting inputs, generating embeddings, and updating metadata is nearly identical in both branches.

Consider extracting this shared logic into a helper function to reduce duplication and improve readability. The helper could take parameters like targetKey, config, updatedMetadatas, and documentsList and handle the embedding process for a single sparse target. This would make applySparseEmbeddingsToMetadatas much simpler, as it would then be primarily responsible for iterating through sparseTargets and calling the helper.

Context for Agents
[**BestPractice**]

This function is quite long and contains duplicated logic for handling `sourceKey === DOCUMENT_KEY` and the general case where `sourceKey` is a metadata field. The core logic of collecting inputs, generating embeddings, and updating metadata is nearly identical in both branches.

Consider extracting this shared logic into a helper function to reduce duplication and improve readability. The helper could take parameters like `targetKey`, `config`, `updatedMetadatas`, and `documentsList` and handle the embedding process for a single sparse target. This would make `applySparseEmbeddingsToMetadatas` much simpler, as it would then be primarily responsible for iterating through `sparseTargets` and calling the helper.

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 362

@jairad26 jairad26 force-pushed the jai/export-python-hosted-splade-model branch from 517106f to c2c379d Compare October 21, 2025 23:24
@jairad26 jairad26 force-pushed the jai/schema-js-impl branch 3 times, most recently from 8bc1124 to 04d4b3f Compare October 22, 2025 00:39
@jairad26 jairad26 force-pushed the jai/export-python-hosted-splade-model branch from c2c379d to d6914e9 Compare October 22, 2025 00:39
@jairad26 jairad26 changed the base branch from jai/export-python-hosted-splade-model to graphite-base/5621 October 22, 2025 01:11
@jairad26 jairad26 changed the base branch from graphite-base/5621 to main October 22, 2025 01:12
@jairad26 jairad26 enabled auto-merge (squash) October 22, 2025 01:13
Comment on lines +27 to +41
const resolveSchemaEmbeddingFunction = (
schema: Schema | undefined,
): EmbeddingFunction | undefined => {
if (!schema) {
return undefined;
}

const embeddingOverride =
schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ?? undefined;
if (embeddingOverride) {
return embeddingOverride;
}

return schema.defaults.floatList?.vectorIndex?.config.embeddingFunction ?? undefined;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The logic in resolveSchemaEmbeddingFunction is duplicated in clients/new-js/packages/chromadb/src/collection.ts (lines 473-487) as getSchemaEmbeddingFunction. To adhere to the DRY principle, this logic should be extracted into a single, shared function. A good location for this would be the new schema.ts file, as it's directly related to schema interpretation and follows ChromaDB's TypeScript architectural patterns.

Additionally, the use of ?? undefined is redundant since optional chaining (?.) already results in undefined if any part of the chain is nullish.

Here's a suggested implementation for a shared function in schema.ts:

// in clients/new-js/packages/chromadb/src/schema.ts
export const resolveSchemaEmbeddingFunction = (
  schema: Schema | undefined,
): EmbeddingFunction | undefined => {
  if (!schema) {
    return undefined;
  }

  return (
    schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ??
    schema.defaults.floatList?.vectorIndex?.config.embeddingFunction
  );
};

This refactoring aligns with ChromaDB's TypeScript patterns where embedding functions are modular and reusable components. The new function can then be imported and used in both chroma-client.ts and collection.ts, removing the duplicated implementations and improving maintainability.

Context for Agents
[**BestPractice**]

The logic in `resolveSchemaEmbeddingFunction` is duplicated in `clients/new-js/packages/chromadb/src/collection.ts` (lines 473-487) as `getSchemaEmbeddingFunction`. To adhere to the DRY principle, this logic should be extracted into a single, shared function. A good location for this would be the new `schema.ts` file, as it's directly related to schema interpretation and follows ChromaDB's TypeScript architectural patterns.

Additionally, the use of `?? undefined` is redundant since optional chaining (`?.`) already results in `undefined` if any part of the chain is nullish.

Here's a suggested implementation for a shared function in `schema.ts`:

```typescript
// in clients/new-js/packages/chromadb/src/schema.ts
export const resolveSchemaEmbeddingFunction = (
  schema: Schema | undefined,
): EmbeddingFunction | undefined => {
  if (!schema) {
    return undefined;
  }

  return (
    schema.keys[EMBEDDING_KEY]?.floatList?.vectorIndex?.config.embeddingFunction ??
    schema.defaults.floatList?.vectorIndex?.config.embeddingFunction
  );
};
```

This refactoring aligns with ChromaDB's TypeScript patterns where embedding functions are modular and reusable components. The new function can then be imported and used in both `chroma-client.ts` and `collection.ts`, removing the duplicated implementations and improving maintainability.

File: clients/new-js/packages/chromadb/src/chroma-client.ts
Line: 41


// Create copies, converting null to empty object
const updatedMetadatas = metadatas.map((metadata) =>
metadata !== null && metadata !== undefined ? { ...metadata } : {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

This line can be simplified. The check metadata !== null && metadata !== undefined can be shortened to metadata != null, which checks for both null and undefined.

Suggested change
metadata !== null && metadata !== undefined ? { ...metadata } : {}
metadata != null ? { ...metadata } : {}

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

This line can be simplified. The check `metadata !== null && metadata !== undefined` can be shortened to `metadata != null`, which checks for both `null` and `undefined`.

```suggestion
      metadata != null ? { ...metadata } : {}
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

File: clients/new-js/packages/chromadb/src/collection.ts
Line: 378

Comment on lines +245 to 248
export const getSparseEmbeddingFunction = (
collectionName: string,
efConfig?: EmbeddingFunctionConfiguration,
) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The console.warn calls were removed from this function for cases where efConfig is missing or has type legacy. However, getEmbeddingFunction still includes these warnings. For a better developer experience and consistency, it would be helpful to add the warnings back to getSparseEmbeddingFunction. This would alert developers when a sparse embedding function is configured but cannot be instantiated, which could otherwise lead to silent failures or unexpected behavior.

ChromaDB Context: Both dense and sparse embedding functions in ChromaDB follow similar configuration patterns. Consistent warning messages across both types help developers quickly identify and resolve embedding function configuration issues, especially when working with legacy configurations or missing parameters.

Context for Agents
[**BestPractice**]

The `console.warn` calls were removed from this function for cases where `efConfig` is missing or has type `legacy`. However, `getEmbeddingFunction` still includes these warnings. For a better developer experience and consistency, it would be helpful to add the warnings back to `getSparseEmbeddingFunction`. This would alert developers when a sparse embedding function is configured but cannot be instantiated, which could otherwise lead to silent failures or unexpected behavior.

**ChromaDB Context**: Both dense and sparse embedding functions in ChromaDB follow similar configuration patterns. Consistent warning messages across both types help developers quickly identify and resolve embedding function configuration issues, especially when working with legacy configurations or missing parameters.

File: clients/new-js/packages/chromadb/src/embedding-function.ts
Line: 248

@jairad26 jairad26 disabled auto-merge October 22, 2025 02:17
@jairad26 jairad26 merged commit 4bf1924 into main Oct 22, 2025
120 of 122 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants